Computing Prosodic Morphology 



5h . 
Oh. 

< 

in 

(N 



> 
in 

o 
^- 
o 

I 



X 



George Anton Kiraz* 

University of Cambridge (St John's College) 
Computer Laboratory 

Pembroke Street 
Cambridge CB2 1TP 
George . Kiraz@cl . cam .ac.uk 



Abstract 

This paper establishes a framework un- 
der which various aspects of prosodic 
morphology, such as templatic morphol- 
ogy and infixation, can be handled under 
two-level theory using an implemented 
multi-tape two-level model. The paper 
provides a new computational analysis of 
root-and-pattern morphology based on 
prosody. 



1 Introduction 



Prosodic Morphology (McCarthy and Prince 
1986, et seq.) provides adequate means for de- 



scribing non-linear phenomena such as infixation, 
reduplication and templatic morphology. Stan- 
dard two-level systems proved to be cumbersome 
in describing such operations - see (Sproat, 1992, 



p. 159 ff.) for a discussion. Multi-tape two-level 
morphology (Kay, 1987; Kiraz, 1994, et. seq.) ad- 
dresses various issues in the domain of non-linear 
morphology: It has been used in analysing root- 



and-pattern morphology (Kiraz, 1994), the Arabic 



broken plural phenomenon (Kiraz, 1996a), and er- 



ror detection in non-c oncatenative strings (Bow- 
den and Kiraz, 1995 ). The purpose of this pa- 



per is to demonstrate how non-linear operations 
which are motivated by prosody can also be de- 
scribed within this framework, drawing examples 
from Arabic. 

The analysis of Arabic presented here differs 
from earlier computational accounts in that it em- 
ploys new linguistic descriptions of Arabic mor- 



phology, viz. moraic and affixational theories (Mo 



Carthy and Prince, 1990b| ; [McCarthy, 1993[ ). 
The former argues that a different vocabulary is 
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needed to represent the pattern morpheme accord- 
ing to the Prosodic Morphology Hypothesis (see 
j jl . l| ) , contrary to the earlier CV model where tem- 
plates are represented as sequences of Cs (conso- 
nants) and Vs (vowels). The latter departed rad- 
ically from the notion of root-and-pattern mor- 
phology in the description of the Arabic verbal 
stem (see §FJ). 

The choice of the linguistic model depends on 
the application in question and is left for the gram- 
marian. The purpose here is to demonstrate that 
multi-tape two-level morphology is adequate for 
representing these various linguistic models. 

The following convention has been adopted. 
Morphemes are represented in braces, { }, and 
surface forms in solidi, / /. In listings of gram- 
mars and lexica, variables begin with a capital 
letter. 

The structure of the paper is as follows: Sec- 
tion H demonstrates how Arabic templatic mor- 
phology can be analysed by prosodic terms, and 
section |3| looks into infixation; finally, section |] 
provides some concluding remarks. The rest of 
this section introduces prosodic morphology and 
establishes the computational framework behind 
this presentation. 

1.1 Prosodic Morphology 

There are three essential principles in prosodic 
morphology (McCarthy and Prince, 1990a ; Mc- 



Carthy and Prince, 1993| ). They are: 



(1) a. Prosodic Morphology Hypothe- 
sis. Templates arc defined in terms of 
the authentic units of prosody: mora 
(fi), syllable (er), foot (Ft), prosodic 
word (PrWd). 
b. Template Satisfaction Condi- 
tion. Satisfaction of templates con- 
straints is obligatory and is determined 
by the principles of prosody, both uni- 
versal and language-specific. 





<J 


<J 


/t 






C V 


c vv 


c vc 



c. Prosodic Circumscription. The 
domain to which morphological oper- 
ations apply may be circumscribed by 
prosodic criteria as well as by the more 
familiar morphological ones. 

In the Prosodic Morphology Hypothesis, 

mora is the unit of syllabic weight; a monomoraic 
syllable, cr M , is light (L), and a bimoraic syllable, 
er w , is heavy (H) . The most common types of syl- 
lables are: open light, CV, open heavy, CVV, and 
closed heavy, CVC. This typology is represented 
graphically in (2). 



(2) 



Association of Cs and Vs to templates is based on 
the Template Satisfaction Condition. Asso- 
ciation takes the following form: a node a always 
takes a C, and a mora /z takes a V; however, in bi- 
moraic syllables, the second (i may be associated 
to either a C or a V.[] 

Prosodic Circumscription (PC) defines the 
domain of morphological operations. Normally, 
the domain of a typical morphological operation 
is a grammatical category (root, stem or word), 
resulting in prefixation or suffixation. Under PC, 
however, the domain of a morphological opera- 
tion is a prosodically-delimited substring within 
a grammatical category, often resulting in some 
sort of infixation. The essential for PC is a pars- 
ing function <f> of the form in (3). 

(3) Parsing Function 
$(C, E) 

Let B be a base (i.e. stem or word). The func- 
tion <E> returns the constituent C that sits on the 
edge E e {right, left} of the base B. The result 
is a factoring of B into: kernel, designated by 
B:$, which is the string returned by the parsing 
function, and residue, designated by B/$, which 
is the remainder of B. The relation between B:$ 
and B/<E> is given in (4), where ^ is the concate- 
nation operator. 

(4) Factoring of B by $ 
B = B:$ B/$ 

To illustrate this, let B = /katab/; applying 
the function ^(<J^, Left) on B factors it into: 
(i) the kernel B:<J> = /ka/, and (ii) the residue 



B/$ = /tab/. 

A morphological operation O (e.g. O = "Pre- 
fix {t}") defined on a base B is denoted by 0(B). 
There are two types of PC: positive (PPC) and 
negative (NPC). In PPC, the domain of the op- 
eration is the kernel B:<E>; this type is denoted by 
0:$ and is defined in (5a). In NPC, the domain 
is the residue B/$; this type is denoted by 0/$ 
and is defined in (5b). 

(5) Definition of PPC and NPC 

a. PPC, 0:$(B) = 0(B:$) ~ B/$ 

b. NPC, 0/$(B) = B:$ ~ 0(B/$) 

In other words, in PPC, O applies to the ker- 
nel B:<3>, concatenating the result with the residue 
B/$; in NPC, O applies to the residue B/$, con- 
catenating the result with the kernel B:<F Exam- 
ples are provided in section |^. 

1.2 Multi-Tape Two-Level Formalism 

Two-level morphology ( Koskenniemi, 1983| ) de- 



fines two levels of strings in recognition and syn- 
thesis: lexical strings represent morphemes, and 
surface strings represent surface forms. Two-level 
rules map the two strings; the rules are compiled 
into finite state transducers, where lexical strings 
sit on one tape of the transducers and surface 
strings on the other. 

Multi-tape two-level morphology is an extension 
to standard two-level morphology, where more 
than one lexical tape is allowed. The notion of us - 
ing multiple tapes first ap peared in (Kay, 1987| ). 
Motivated by Kay's work, ( Kiraz, 1994 ) proposed 
a multi-tape two-level model. The model adopts 



the formalism in (6) as reported by (Pulman and 
Hcpplc, 1993[ ). 



(6) 



LLC 
LSC 



Lex RLC {^,^} 

Surf RSC 



1 Other conventions associate consonant melodies 
left-to-right to the moraic nodes, followed by associ- 
ating vowel melodies to syllable-initial morae. 



where LLC is the left lexical context, Lex is the 
lexical form, RLC is the right lexical context, 
LSC is the left surface context, Surf is the sur- 
face form, and RSC is the right surface context. 

The special symbol * indicates an empty con- 
text, which is always satisfied. The operator =>■ 
states that Lex may surface as Surf in the given 
context, while the operator adds the condition 
that when Lex appears in the given context, then 
the surface description must satisfy Surf. The 
latter caters for obligatory rules. A lexical string 
maps to a surface string iff (1) they can be par- 
titioned into pairs of lexical-surface subsequences, 
where each pair is licenced by a rule, and (2) no 
partition violates an obligatory rule. 



One of the extensions introduced in the multi- 
tape version is that all expressions in the lexical 
side of the rules (i.e. LLC, Lex and RLC) are 
n-tuple of regular expressions of the form (xi , X2 , 
. . ., x n ). The ith expression refers to symbols on 
the ith tape. When n = 1, the parentheses can be 
ignored; hence, (x) and x are equivalent^] 

2 Templatic Morphology 

Tcmplatic morphology is best exemplified in 
Semitic root-and-pattern morphology. This sec- 
tion sets a framework under which templatic mor- 
phology can be described using (augmented) two- 
level theory. Our presentation differs from pre- 
vious proposals^] in that it employs prosodic mor- 
phology in the analysis of Arabic, rather than ear- 
lier CV accounts. Arabic verbal forms appear in 
(7) in the passive (rare forms are not included). 



(7) Arabic Verbal Measures (1-8, 10) 



1 kutib 

2 kuttib 

3 kuutib 

4 puktib 

5 tukuttib 



6 tukuutib 

7 nkutib 

8 ktutib 

10 stuktib 



( [McCarthy, 1993| ) points out that Arabic verbal 
forms are derived from the base template in (8), 
which represents Measure 1. a x represents an ex- 
trametrical consonant; that is, the last consonant 
in a stem. 

(8) Arabic Base Template 
a o Ox 



kutib 

The remaining measures are derived from the base 
template by affixation; they have no templates of 
their own. The simplest operation is prefixation, 
e.g. {n} + Measure 1 — > /nkutib/ (Measure 7). 
Measures 4 and 10 are derived in a similar fashion, 
but undergo a rule of syncope as shown in (9). 



2 Our imple mentation interprets rules directly (see 
(Kiraz, 1996c)); hence, we allow unequal representa- 



tion of strings. If the rules were to be compiled into 
automata, a genuine symbol, e.g. 0, must be intro- 
duced by the rule compiler. For the compilation of our 



formalism into automata, see ( Grimley- Evans et al 
19961 ) 



Non-l ine ar proposals in clude 
nai. 199l[). f|Wicbc. 1992|h (|Na.ra.yana.ii and^Hasticm 



1993| ), ( pird and Ellison, 1994 and ( |Kiraz 1994|) 



{ktb} 
{ui} 

{pV} 

{n} 

{stV} 



(9) Derivation of Measures 4 and 10 
Syncope: V — > /[CVC CVC] stem 

a. Measure 4: ?u + kutib — > */?ukutib/ 

syncope 

b. Measure 10: stu + kutib — > 
*/stukutib/ sy " /stuktib/ 

The following lexicon and two-level grammar 
demonstrate how the above measures can be anal- 
ysed under two-level theory. The lexicon main- 
tains four tapes: pattern, root, vocalism and affix 
tapes. 

pattern: [measure=(l-8, 10)] 
root: [measure=(l-4,6-8,10)] 
vocalism: [tense=perf , 
voice=pass] 

verb_affix: [measure=4] 
verb_affix: [measure=7] 
verb_affix: [measure=10] 

The first column indicates the tape on which the 
morpheme sits, and the second column gives the 
morpheme. Each lexical entry is associated with 
a category and a feature structure of the form 
cat :FS (column 3). Feature values in parentheses 
are disju nctive and are implemented u sing boolean 
vectors ( |Mcllish, 1988| ; |Pulman, 1991) . 

{Cfi&vO'x} is the base-template, {ktb} 'notion 
of writing' is the root; it may occur in all measures 
apart from Measure 5.0 {ui} is the perfective pas- 
sive vocalism. The remaining morphemes repre- 
sent the affixes for Measures 4, 7 and 10. Notice 
that the vowel in the affixes of Measures 4 and 10 
is a variable V. This makes it possible for the affix 
to have a different vowel according to the mood of 
the following stem, e.g. [a] in /paktab/ (Measure 
4, active) and [u] in /?uktib/ (Measure 4, passive). 

Since the lexicon declares 4 lexical tapes, each 
lexical expression in the two-level grammar must 
be at most a 4-tuple. A grammar for the deriva- 
tion of the cited data appears below. 

(a M ,C,V,e> * =► 

CV * 



Rl 



R2 
R3 
R4 
R5 



* (a x ,C,e,e) 

* C 

* (e,e,e,A> * 

* _ a. — * 

(X*,e,e) (+,+,+,£> 

* e 

(e,e,e,A) <£,£,£,+) 

* e 



(+,+,+,£} & 



A working s ystem for Arabic is reported by f Beesley 
st al., 198S| ; |Beesley, 199(i| ; [Beesley, 199l| ). 



4 Roots do not occur in all measures in the litera- 
ture. Each root is lexically marked with the measures 
it occurs in. 



R6 



CiV 



<o>,C,V,e 
C 



C2V1C3 
Vi=vowel, A=verbal 



where Ci=radical 
affix, and X ^ +. 

Rule Rl handles monomoraic syllables mapping 
(ct^jCjVjE) on the lexical tapes to CV on the sur- 
face tape. Rule R2 maps the extrametrical conso- 
nant in a stem (i.e. the last consonant in a stem) 
to the surface. Rule R3 maps an affix symbol from 
the fourth tape to the surface. Rules R4 and R5 
delete the boundary symbols from stems and af- 
fixes, respectively. Finally, rule R6 simulates the 
syncope rule in (9); note that V in LSC must unify 
with V in Lex, ensuring that the vowel of the af- 
fix has the same quality as that of the stem, e.g. 
/paktab/ and /Pu+ktib/ (Measure 4). 

The two-level analysis of the cited forms ap- 
pears below - ST = surface tape, PT = pattern 
tape, RT = root tape, VT = vocalism tape, and 
AT = affix tape. 



Measure 1 



Measure 4 





u 


i 


+ 


k 


t 


b 


+ 








+ 


112 4 


ku 


ti 


b 





AT 
VT 
RT 
PT 

ST 



3 3 5 6 1 2 4 



ti 



Measure 7 



Measure 10 



n 


+ 




AT 


s 


t 


u 


+ 




u 


i 


+ 


VT 


u 


1 


+ 


k 


t 


b 


+ 


RT 


k 


t 


b 


+ 




o> 


a x 


+ 


PT 








+ 


3 


5 


1 


1 


2 


4 




3 


3 


3 


5 


G 


1 


2 


4 


11 




ku 


ti 


b 




ST 


s 


t 


u 




k 


ti 


1) 





The numbers between the two levels indicate the 
rule numbers in (8) which sanction the sequences. 
The remaining Measures involve infixation and are 
discussed in the next section. 

3 Infixation 

Standard two-levels models can describe some 
classes of infixation, but resorting to the use of 
ad hoc diacr itics which have no linguistic signif- 
icance, e.g. ( Antworth, 199C , p. 156). This sec- 
tion presents a framework for describing infixa- 
tion rules using our multi-tape two-level formal- 
ism. This is illustrated here by analysing Mea- 
sures 2 and 8 of the Arabic verb. Measure 2, /kut- 
tib/, is derived by prefixing a mora to the base 
template under NPC. The operation is O = 'prefix 
fi' and the rule is 0/$(<t m , Left). The new mora 
is filled by the spreading of the adjacent (second) 
consonant. The steps of the derivation are: 



0/$(kutib) = kutib:$ * 0(kutib:$) 
= ku * O(tib) 
= ku * /xtib 
= ku * ttib 
= kuttib 

Measure 8, /ktutib/, is derived by the affixation 
of a {t} to the base template under NPC. The 
operation is O = 'prefix {t}'; the rule is 0/<f>(C, 
Left), where C is a consonant. The process is: 



0/$(kutib) 



kutib:$ * 0(kutib:$) 
k * O(utib) 
k * tutib 
ktutib 



The following two-level grammar builds on the 
one discussed in section 0. The following lexical 
entry gives the Measure 8 morphemes. 

4 {t} verb_af f ix : [measure=8] 



The additional two-level rules are: 



R7 



(cr M ,Ci,Vi,e) 



e 
C 



(ff M ,C,*e) 



Features: [measure=(2,5)] 



<a M ,C,V,A) * => 

R8 * CAV * 

Features: [measure=8] 

where C,=radical, Vi=vowel, A=verbal 
affix, and X / +. 

Rules R7-R8 are measure-specific. Each rule 
is associated with a feature structure which must 
unify with the feature structures of the affected 
lexical entries. This ensures that each rule is ap- 
plied only to the proper measure. 

R7 handles Measure 2; it represents the opera- 
tion O = 'prefix and the rule 0/<fr(cr M , Left) by 
placing B:<£> in LLC and the residue B/$ in RLC, 
and inserting a consonant C (representing fj,) on 
the surface. The filling of fi by the spreading of 
the second radical is achieved by the unification 
of C in Lex with C in RLC. 

R8 takes care of Measure 8; it represents the 
operation O = 'prefix {t}' and the rule 0/<I>(C, 
Left). Note that one cannot place B:<I> and B/$ 
in LLC and RLC, respectively, as the case in R7 
because the parsing function cuts into the first 
syllable. 

One remaining Measure has not been discussed, 
Measure 3. It is derived by prefixing the base 
template with /1. The process is as follows: 



a a(J x 
fi + 1^1^ 
k u t i b 
a a cr x 



a oo x 

l\ l\ 
k u t i b 



k u u t i b 
The corresponding two-level rule follows. It 
adds a fi by lengthening the vowel V into VV. 
<<7 M ,C,V,e) * => 

R9 * CVV * 

Features: [measure=(3,6)] 

The two-level derivations are: 



Measure 2 





AT 


u 


i 


+ 


VT 


k 


t 


b 


+ 


RT 






0~x 


+ 


PT 


1 


7 


1 


2 


4 




ku 


t 


1 ti 


b 




ST 



Measure 3 





AT 


u 


i 


+ 


VT 


k 


t 


b 


+ 


RT 


°v« 


Op 




+ 


PT 


9 


1 


2 


4 




kuu 


ti 


b 




ST 



Measure 8 



t 


+ 




AT 


u 


l 


+ 


VT 


k 


t 


b 


+ 


RT 






O x 


+ 


PT 


8 


5 


1 


2 


4 




ktu 




ti 


b 




ST 



Finally, Measures 5 and 6 are derived by prefix- 
ing {tu} to Measures 2 and 3, respectively. 

4 Conclusion 

This paper have demonstrated that multi-tape 
two-level systems offer a richer and more powerful 
devices than those in standard two- level models. 
This makes the multi-tape version capable of mod- 
elling non-linear operations such as infixation and 
templatic morphology. 

The rules and lexica samples reproduced here 
are based on a larger morphological grammar 
written for the SemHe implementation (a multi- 
tape two-level system) - for a full description of 
the system, see (Kiraz, 1996c; Kiraz, 1996b[ ). 
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